Interpretable machine learning models for predicting clinical pregnancies associated with surgical sperm retrieval from testes of different etiologies: a retrospective study

Background The relationship between surgical sperm retrieval of different etiologies and clinical pregnancy is unclear. We aimed to develop a robust and interpretable machine learning (ML) model for predicting clinical pregnancy using the SHapley Additive exPlanation (SHAP) association of surgical sperm retrieval from testes of different etiologies. Methods A total of 345 infertile couples who underwent intracytoplasmic sperm injection (ICSI) treatment with surgical sperm retrieval due to different etiologies from February 2020 to March 2023 at the reproductive center were retrospectively analyzed. The six machine learning (ML) models were used to predict the clinical pregnancy of ICSI. After evaluating the performance characteristics of the six ML models, the Extreme Gradient Boosting model (XGBoost) was selected as the best model, and SHAP was utilized to interpret the XGBoost model for predicting clinical pregnancies and to reveal the decision-making process of the model. Results Combining the area under the receiver operating characteristic curve (AUROC), accuracy, precision, recall, F1 score, brier score, and the area under the precision-recall (P-R) curve (AP), the XGBoost model has the best performance (AUROC: 0.858, 95% confidence interval (CI): 0.778–0.936, accuracy: 79.71%, brier score: 0.151). The global summary plot of SHAP values shows that the female age is the most important feature influencing the model output. The SHAP plot showed that younger age in females, bigger testicular volume (TV), non-tobacco use, higher anti-müllerian hormone (AMH), lower follicle-stimulating hormone (FSH) in females, lower FSH in males, the temporary ejaculatory disorders (TED) group, and not the non-obstructive azoospermia (NOA) group all resulted in an increased probability of clinical pregnancy. Conclusions The XGBoost model predicts clinical pregnancies associated with testicular sperm retrieval of different etiologies with high accuracy, reliability, and robustness. It can provide clinical counseling decisions for patients with surgical sperm retrieval of various etiologies. Supplementary Information The online version contains supplementary material available at 10.1186/s12894-024-01537-1.


Introduction
The prevalence of infertility is gradually increasing worldwide, with about 8-12% of couples suffering from infertility and 10-20% of infertile men suffering from azoospermia [1,2].Decreased male fertility is often associated with testicular dysfunction, endocrine disruption, poor lifestyle, congenital developmental abnormalities, radiation, endocrine disruptor exposure, and aging, but 40% of male infertility is idiopathic and has a novel single gene linkage in the pathogenesis [3,4].With increasing infertility, the use of assisted reproductive technologies has risen dramatically in the last decade, but the success rate of intracytoplasmic sperm injection (ICSI) depends on several factors, and in particular, has a significant relationship with sperm quality [5].Azoospermia is categorized into obstructive azoospermia (OA) and nonobstructive azoospermia (NOA), and sperm are often obtained for ICSI in in vitro fertilization (IVF) centers using testicular sperm aspiration (TESA) and microdissection testicular sperm extraction (mTESE) [6].TESA is a more commonly used procedure for extracting male sperm in assisted reproduction techniques than mTESE.
TESA is commonly used in cases where the male partner is unable to ejaculate or has no available highquality sperm on the day of oocyte retrieval, including erectile dysfunction (ED), temporary ejaculatory disorders (TED), complete retrograde ejaculation, OA, necrospermia, and high sperm deoxyribonucleic acid (DNA) fragmentation [7][8][9].mTESE is most commonly used for testicular sperm retrieval in NOA.The uncertainty of treatment outcomes in assisted reproductive technology and the variation in the quality of testicular spermatozoa obtained by surgical sperm retrieval for different etiologies affect clinical pregnancy with ICSI treatment to varying degrees.Therefore, there is a need to develop a predictive model to assess and interpret the clinical pregnancy outcomes of ICSI for counseling infertile couples.
Machine learning (ML) is a type of artificial intelligence that overcomes the limits of expert systems by having manually written rules replaced by rules discovered manually from data, allowing ML systems to learn from data and explain unknown situations [10].More traditional logistic regression models have been used to predict IVF outcomes by female factors [11,12].At the same time, fewer reports have been made on the prediction of clinical pregnancy in ICSI by male factors, especially the prediction model of clinical pregnancy in ICSI by the use of testicular spermatozoa of different etiologies has been reported for the first time.This study aimed to develop six ML models for the prediction of clinical pregnancy.We selected ML models with optimal performance and used ML visualization based on the SHapley Additive exPlanation (SHAP) to determine the contribution of surgical sperm retrieval for different etiologies in predicting clinical pregnancy [13].

Data source and study design
This was a retrospective study.In this study, 420 infertile couples who underwent surgical testicular sperm retrieval with ICSI for different etiologies at the IVF center of the Second Affiliated Hospital of Wenzhou Medical University between February 2020 and March 2023 were selected, of which 345 cases met the inclusion criteria.The present study protocol was reviewed and approved by the Institutional Review Board of the Second Affiliated Hospital of Wenzhou Medical University (approval No. 2022-K-196-01).All participants signed written informed consent.
Inclusion criteria included: ⑴ female age ≤ 40 years old, ⑵ ICSI and embryo transfer in fresh cycles, ⑶ informed consent has been obtained, ⑷ TESA or mTESE was performed in the male on the day of female ovum retrieval, ⑸ The male partner was not given medication affecting sperm before TESA or mTESE, ⑹ the female partner is free from reproductive and systemic diseases that significantly affect clinical pregnancy in ICSI, ⑺ stimulation of ovulation using a uniform follicular length program, and ⑻ no fertility-related genetic or chromosomal abnormalities in either partner.Exclusion criteria included: ⑴ important data missing, ⑵ no complete fresh IVF treatment was performed, ⑶ other ICSI treatment options, and ⑷ sexually transmitted diseases and psychogenic ED.

Groupings and definitions
According to the different etiologies, the spermatozoa obtained by surgical sperm extraction on the day of ovum retrieval were divided into 103 cases in the NOA group, 81 cases in the OA group, 95 cases in the ED group, and 66 cases in the TED group.ED is defined according to the European Association of Urology guidelines as the persistent or recurrent inability to achieve or maintain a penile erection sufficiently satisfactory to satisfy sexual intercourse in response to appropriate sexual stimulation for at least 6 months [14].The diagnosis of ED relies on medical history data, the international index of erectile function 5 (IIEF-5), physical examination, color Doppler ultrasonography, blood parameters, and history of previous medications, and the patients with ED in this study had predominantly organic erectile dysfunction [15].Definition of azoospermia according to the 2021 World Health Organization laboratory manual for the examination and processing of human semen (6th edition) guidelines 3 analyses of semen were performed, and no sperm were observed at high magnification in at least 2 samples analyzed more than 2 weeks apart [16].TESA can be used to identify OA and NOA, as well as mTESE is the gold standard for surgical sperm extraction in patients with NOA [17].TED is a condition in which the patient can normally ejaculate through masturbation or sexual intercourse and fail to ejaculate on the day of female oocyte retrieval due to psycho-psychological factors.Females' serum folliclestimulating hormone (FSH) and anti-müllerian hormone (AMH) sampling and analysis were performed on days 2-3 of the menstrual period.

TESA and mTESE operating procedure
The TESA surgical approach was similar to that described by authors Cito G et al. [18].The operator was sterilized, the surgical towel was spread, the patient was anesthetized by infiltration of the scrotal skin to the testicular leucomembrane layer by layer using 5 ml of lidocaine, and the testis was slowly punctured using a sharp pointed puncture needle with a side hole attached to a 10 ml syringe, and the testicular tissue was obtained by negative pressure suction and quickly placed into a petri dish to be passed to the embryologists for the ICSI operation.
The mTESE surgical approach was similar to that described by authors Jensen C et al. [6].The surgeon used a scalpel to make a transverse incision in the anterior middle of the scrotum, extruded the testis, and incised the meatus and the following tissues to reveal the testicular leucorrhaphy.The testis was fixed with the left hand.A 1-cm transverse incision was made in the white membrane of the anterior middle of the testis under 6x surgical magnification to reveal the testicular tissue.The blood vessels were electrocauterized to stop bleeding.The thick white varicocele was searched for under a 25x surgical microscope.The tubules that may contain spermatozoa were removed with ophthalmic scissors, and placed under a microscope to examine and search for spermatozoa.After obtaining sufficient spermatozoa, the wound was closed with adequate hemostasis.

Controlled ovarian stimulation (COS), ICSI process, and definition of the label
The COS and ICSI operating procedures were similar to those described by authors Bedenk J et al. [19].One or two blastocyst-stage embryos were transferred with luteal support per cycle.Serum human chorionic gonadotropin (HCG) ≥ 15 IU/L measured on the 14th day after transfer was considered positive for HCG, and the detection of a gestational sac on ultrasound on the 30th day after transfer was considered positive for clinical pregnancy.

Feature engineering
Based on previous studies [20,21] and expert opinions (3 independent specialists in andrology and reproductive medicine from the Second Affiliated Hospital of Wenzhou Medical University), we developed an initial predictive clinical pregnancy model with 22 variables as candidate independent variables (Supplementary Material 1).To eliminate the multicollinearity of the data and remove redundant features, we processed the data using Recursive Feature Elimination (RFE) to remove 1 feature (HCG) [22].We used the ML-based Random Forest (RF) algorithm (missForest R package) to interpolate features with missing values less than 10% [23].The models included in the final prediction of clinical pregnancy included 21 features, and 1 label, with 4 categorical variables and 17 continuous variables.We use the MinMax-Scaler to normalize the data for continuous features and a one-hot code for categorical features [24].There is no significant data imbalance in the labeling of this study therefore no Synthetic Minority Oversampling Technique (SMOTE) processing of the data was required [25].We randomly split all the data in a ratio of 80:20, where 80% is used as a training set to train the models and 20% is used as a test set to test the models.

Predictive modeling strategy
We used six ML algorithms, including k-nearest neighbor (KNN), support vector machine (SVM), RF, categorical boosting (CatBoost), extreme gradient boosting (XGBoost), and gradient boosting decision tree (GBDT) to develop the prediction models.We used a grid search to tune the hyperparameters and 5-fold cross-validation to obtain the optimal combination of hyperparameters for optimal model performance in predicting clinical pregnancy.Different ML algorithms have their dataapplicable characteristics, so we trained six different ML models to predict clinical pregnancies to test the reliability, accuracy, and robustness of the models.We evaluated the performance and robustness of the predictive models by calculating the area under the receiver operating characteristic curve (AUROC), accuracy, precision, recall, F1 score, brier score, and the area under the precision-recall (P-R) curve (AP) for each predictive clinical pregnancy ML model.After comparing the performance discriminant characteristics of each ML model for predicting clinical pregnancy, the model with the best AUROC performance was selected as the optimal model for predicting clinical pregnancy, and the decision-making process of the model was interpreted using the SHAP.

Interpretation of the model using the SHAP
The SHAP generates a SHAP value for each feature of the ML model to determine the value of the feature's contribution to the clinical pregnancy prediction, with a positive or negative SHAP value indicating a positive or negative influence on the feature's contribution to the clinical pregnancy prediction [26].The SHAP summary plot provides a direct view of the importance of each feature and the contribution of each feature to the output of the ML model, while the SHAP force plot provides a visual understanding of how the ML model makes decisions about clinical pregnancy prediction [27].

Statistical analysis
Continuous variables with normal distribution were expressed as mean ± standard deviation (SD) according to the data distribution; otherwise, they were expressed as median and interquartile range, and comparisons between groups were made using the Mann-Whitney U test.Categorical variables were expressed as frequencies (percentages) using Pearson's Chi-square test or Fisher's exact test, and a two-sided P < 0.05 was considered statistically significant.Statistical analysis of data between groups was performed using R 4.3.1.The R packages used were tidyverse, haven, gtsummary, MASS, missForest, and caret.ML models were analyzed using Python 3.11 software using scikit-learn1.3.0.

Baseline characteristics of participants
Figure 1 illustrates the participant screening and study design process for this study.Table 1 summarizes the baseline characteristics of the participants.Age, body mass index (BMI), tobacco use, FSH, Irisin, Nesfatin-1, NOA group, and ED group were significantly lower in the male participants in the clinical pregnancy group than in the non-clinical pregnancy group, with statistically significant differences between groups (P < 0.05).In contrast, the testicular volume (TV), total testosterone (TT), Inhibin B, Johnsen score, and TED groups of male participants in the clinical pregnancy group were significantly higher than those in the non-clinical pregnancy group, with statistically significant differences between groups (P < 0.05).The BMI, Age, and FSH of the female participants in the clinical pregnancy group were significantly lower than those in the non-clinical pregnancy group, and there was a statistically significant difference between the groups (P < 0.05).In contrast, the antral follicle count (AFC) and AMH of female participants in the clinical pregnancy group were significantly higher than those in the non-clinical pregnancy group, with statistically significant differences between groups (P < 0.05).In addition, we separately counted the baseline characteristics of the OA, NOA, and TED groups.In the NOA group, the males had a TV (ml): 10 ( 9

Performance comparison of ML models for predicting clinical pregnancy
Table 2 summarizes the performance of the six ML models in predicting clinical pregnancy.Among them, the XGBoost model best predicted clinical pregnancy with AUROC of 0.858, 95% confidence interval (CI) of 0.778-0.936,and AP of 0.810 (Figs. 2 and 3).The accuracy of the XGBoost model was 79.71% with the F1 score of 0.731, and a comprehensive analysis based on the performance of the six ML models showed that XGBoost had the highest accuracy and robustness in predicting clinical pregnancy (Fig. 4).

The XGBoost model SHAP features importance and individual decision-making
We used SHAP global summary plots to visualize the effect of each feature in the XGBoost model on the prediction of clinical pregnancy in the test dataset.The importance of the SHAP plot features suggests that the age of the female is the most important feature in the XGBoost model for predicting clinical pregnancy.The SHAP plot showed that younger age of women, higher TV, non-tobacco use, higher AMH, lower FSH in females, lower FSH in males, the TED group, and the non-NOA group all resulted in an increased probability of clinical pregnancy (Fig. 5).

Discussion
We successfully trained and tested six ML predictive models for clinical pregnancy in ICSI with surgical testicular sperm retrieval for different etiologies.We found that the XGBoost model performed the best and was selected for predicting clinical pregnancies.The average AUROC of the XGBoost model in the test data was 0.858 (95% CI: 0.778-0.936),representing excellent model efficiency and robustness.We used SHAP global summary plots to show the importance of each feature in the XGBoost model, suggesting that female age is the most important feature.The probability of pregnancy was higher in patients with younger female age, bigger TV, non-tobacco use, higher AMH, lower FSH in females, lower FSH in males, TED, and non-NOA.Our SHAP force plot clearly shows how the XGBoost model makes individual decisions about non-clinical or clinical pregnancy and how each feature contributes to the predicted outcome, allowing the users to trust and understand the model more.To the best of our knowledge, this is the first study to apply ML methods to predict clinical pregnancy in ICSI using testicular spermatozoa of different etiologies.The use of testicular sperm for ICSI has become the mainstay of treatment at IVF centers for infertile couples who are unable to provide their semen and a sufficient amount of competent sperm on the day of oocyte retrieval.Not all infertile couples are successful in achieving a clinical pregnancy with this treatment option.For this reason, there is an urgent need for infertile couples to understand the possible clinical pregnancy outcomes of ICSI using testicular sperm to inform their weighing of acceptable treatment risks and costs.
Therefore, the ML clinical pregnancy prediction model we developed using testicular spermatozoa for different etiologies has important clinical applications.Song J et al. used different types of acquired, idiopathic, and congenital azoospermia for predicting clinical pregnancy outcomes [28].They constructed a logistic regression prediction model, which showed that bigger TV, higher testosterone levels, younger age of the woman, bigger AFC, and higher AMH were associated with a higher probability of clinical pregnancy.This is consistent with our findings suggesting that better ovarian and testicular function are associated with successful clinical pregnancy.We used an interpretable ML approach to construct a clinical pregnancy prediction model with more powerful data analysis capabilities than traditional logistic regression and effectively improved the accuracy and robustness of predicting clinical pregnancy.The SHAP summary plot we plotted suggests that female age is the most important model feature.Similar to our findings, Kato K et al. constructed an IVF clinical pregnancy prediction model using only female age and embryo developmental rate as independent variables, and their study showed that female age has a strong relationship with IVF clinical pregnancy [29].Tsafrir A al. studied clinical pregnancy in IVF using frozen oocytes and showed a significant negative correlation between female age and oocyte quality [30].This suggests that female age affects the final clinical pregnancy outcome by influencing oocyte quality.Li F et al. studied and constructed a clinical pregnancy prediction model for females with poor ovarian response to IVF and ICSI treatment, suggesting that women older than 35 years, with higher BMI, and higher basal FSH were associated with lower clinical pregnancy rates [31].This is in general consistent with our findings suggesting that older female age, higher BMI, and higher basal FSH are negatively associated with clinical pregnancy.
In addition to female factors being associated with clinical pregnancy outcomes, male factors are also strongly implicated.Our findings showed that the TED group had a higher clinical pregnancy rate than the other groups, while the NOA group had the lowest clinical pregnancy rate, and we hypothesized that it might be related to the fact that the TED group had higher sperm quality, while the NOA group had poorer sperm quality.Our testicular histopathologic analysis also confirmed that the NOA group had a lower Johnsen score than the other groups.Similar conclusions were reached by Aboukhshaba A et al. who concluded that the fertilization and live birth rates of NOA testicular spermatozoa obtained by m-TESE were relatively low [32].We counted the sperm parameters of the TED group before the ICSI cycle and showed that sperm concentration, progressive motility, sperm normal forms, and sperm DNA fragmentation index were within the normal reference ranges, which represented a high sperm quality in the TED group.Our data showed that the ratio of TED patients to total ICSI cycles was 19.13%, which may be related to the influence of lifestyle habits, the environment in which spermatozoa are obtained, and multiple psychological factors.Zhang X et al. showed that nearly 10% of males in the total cycle of IVF and ICSI had TED, which was related to the fact that males undergoing IVF treatment were often worried about abnormal semen results, the outcome of IVF treatment, socio-economic pressures, and psychiatric disorders such as anxiety and depression, which were very common among males undergoing IVF treatment [33].Similar views were expressed by Wang J et al., who investigated and found that the prevalence of male TED on the day of female ovum acquisition was 8.3% [34].According to epidemiologic surveys, up to 52% of diabetic men experience ED, and 35-50% have ejaculatory disorders.We hypothesized that the effect of the ED group on clinical pregnancy may be related to disturbances in glucose metabolism, which may affect sperm quality [35,36].Service CA et al. made a similar point, suggesting that male obesity, diabetes, and metabolic syndrome negatively affect all sperm parameters and that male obesity, diabetes, and metabolic syndrome negatively correlate with clinical pregnancy and live birth rates, whether conceived through natural fertilization or assisted reproductive technology [37].
We trained and tested six ML models and chose the XGBoost model as the optimal model to predict clinical pregnancies associated with surgical sperm retrieval of different etiologies, which has several strengths and limitations.First, we used the SHAP to interpret the XGBoost model for predicting clinical pregnancy, making it easier for users to understand the model's decisionmaking process and trust the model.Second, we utilize the powerful data analysis capabilities of ML to construct clinical pregnancy prediction models with higher efficiency, accuracy, and robustness than traditional models.Third, the clinical pregnancy prediction model we developed for using testicular spermatozoa with different etiologies can provide valuable clinical counseling strategies for such infertile couples.Fourth, our study also has the limitation that the amount of data collected in this study needs to be further improved, which is determined by the prevalence of the disease.

Conclusions
Our developed XGBoost model for predicting clinical pregnancy based on surgical retrieval of testicular spermatozoa from different etiologies has high accuracy, efficiency, and robustness.Our use of the SHAP makes the XGBoost model more interpretable and can provide accurate, efficient, and practical clinical counseling decisions for such infertile couples.We will further extend the model features and update the training and validation data of the model to generalize the applicability of this clinical pregnancy prediction model to IVF centers in different countries and further improve the interpretability of the XGBoost model.

Fig. 1
Fig. 1 Participant screening and study design flowchart

Figure 6
illustrates individual force plot for (A) non-clinical pregnancy and (B) clinical pregnancy.The horizontal axis in the force plot represents the predicted value of clinical pregnancy probability f(x) and is labeled with the base value, and the SHAP values labeled below indicate the contribution of each feature in the XGBoost model to the prediction of clinical pregnancy.Red arrows on the left side indicate features that increase the probability of clinical pregnancy, blue arrows on the right side indicate features that decrease the probability of clinical pregnancy, and the length of the arrows indicates the degree to which each feature contributes to the prediction of clinical pregnancy.The red and blue intersections represent individual predictions from the XGBoost model, with 0 indicating a non-clinical pregnancy and 1 indicating a clinical pregnancy.

Table 1
Baseline characteristics of participants